Atom AI Labs - AI-Powered Multi-Tenant Platform

PostgreSQL Graph Architecture Proposal

1. Executive Summary

We propose shifting the GraphRAGEngine storage backend from **LanceDB (Edge List)** to **PostgreSQL (Relational Graph)**.

**Why?**

**Statelessness**: Allows utilizing atomic SQL queries (Recursive CTEs) for traversal without loading the entire graph into RAM.
**Infrastructure**: Reuses the existing PostgreSQL instance. No new infrastructure (Embedded Graph) or sticky volumes required.
**Reliability**: ACID compliance for graph updates.

2. Proposed Schema

We will introduce 3 normalized tables to represent the property graph structure.

`graph_nodes`

Represents entities (People, Projects, Documents).

CREATE TABLE graph_nodes (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    name VARCHAR(512) NOT NULL,
    type VARCHAR(50) NOT NULL, -- e.g., 'person', 'task'
    description TEXT,
    properties JSONB DEFAULT '{}', -- Flexible metadata
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    
    -- Fast lookup
    CONSTRAINT uniq_node_workspace_name UNIQUE (workspace_id, name, type)
);
CREATE INDEX idx_nodes_workspace ON graph_nodes(workspace_id);

`graph_edges`

Represents relationships.

CREATE TABLE graph_edges (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    source_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    target_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    relationship_type VARCHAR(50) NOT NULL, -- e.g., 'manages', 'blocks'
    weight FLOAT DEFAULT 1.0,
    properties JSONB DEFAULT '{}',
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_edges_source ON graph_edges(source_node_id);
CREATE INDEX idx_edges_target ON graph_edges(target_node_id);

`graph_communities` (For GraphRAG)

Stores the hierarchical clusters detected by the Leiden algorithm.

CREATE TABLE graph_communities (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    workspace_id VARCHAR(255) NOT NULL,
    level INT NOT NULL, -- Hierarchy level (0, 1, 2)
    summary TEXT NOT NULL, -- LLM-generated summary
    keywords JSONB, -- List of top keywords
    created_at TIMESTAMP DEFAULT NOW()
);

`community_membership`

Maps nodes to communities.

CREATE TABLE community_membership (
    community_id UUID REFERENCES graph_communities(id) ON DELETE CASCADE,
    node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
    PRIMARY KEY (community_id, node_id)
);

3. Workflow & Logic

A. Ingestion (Write)

When a document is processed:

**Extract**: LLM extracts (Entity A, Relation, Entity B).
**Upsert Nodes**: Insert Entity A and Entity B into graph_nodes (ON CONFLICT DO NOTHING).
**Insert Edges**: Insert the relationship into graph_edges.

B. Retrieval (Read) - "Stateless Graph"

Instead of hydrate_workspace (loading 10k nodes into RAM), we use **Recursive SQL**.

**Local Search (Traversal):**

*Question*: "What does Sarah work on?"
*SQL*: Find Node "Sarah" -> JOIN graph_edges -> JOIN graph_nodes (Target).
*Depth*: Efficiently handled by Postgres for depths 1-3.

-- Example: Find everything Sarah touches (Depth 1)
SELECT target.name, edge.relationship_type
FROM graph_nodes source
JOIN graph_edges edge ON source.id = edge.source_node_id
JOIN graph_nodes target ON edge.target_node_id = target.id
WHERE source.name = 'Sarah' AND source.workspace_id = 'ws_123';

C. Community Detection (Async Job)

Since we can't run Leiden/Louvain efficiently in SQL:

**Job**: Periodically (e.g., hourly), a worker process fetches the full graph for a workspace.
**Compute**: Loads into NetworkX (RAM), runs Leiden algorithm.
**Summarize**: Generates summaries via LLM.
**Save**: Writes results to graph_communities table.
**Free RAM**: Process exits or clears memory.

4. Pros vs Cons

Feature	Postgres Architecture	LanceDB/In-Memory (Legacy)
Persistence	✅ Excellent (RDS)	❌ Poor (Volatile RAM)
Memory Usage	✅ Low (O(1) per query)	❌ High (O(N) per workspace)
Traversal Speed	⚠️ Medium (SQL Joins)	✅ Instant (In-Memory)
Complexity	✅ Low (Standard SQL)	⚠️ Medium (Custom Engine)
Vector Search	✅ Native (pgvector)	✅ Native (LanceDB)

*Note: We will KEEP LanceDB for pure text chunk vector search. Postgres is purely for the structural Metadata Graph.*

5. Migration Strategy

**Definition**: Create SQLAlchemy models in backend-saas/core/models.py.
**Migration**: Generate Alembic migration scripts.
**Refactor**: Update GraphRAGEngine to read/write from SQL models instead of internal dicts.